Blog · GitLab (User) Created at 2018-1-4 14:57:28 Updated at 2018-1-20 18:49:11 Tomáš Hübelbauer
Turn into a bloggo

Modern Office Git Diff

GitLab pipeline status

An experiment in tracking and diffing versions of modern Microsoft Office files in Git.

Modern Office file formats are ZIP archives with XML files in them. The ZIP archives are binary files so Git (and furthemore GitHub, GitLab where diff cannot be tweaked) won't display a nice diff for them. The XML files are not binary, so in order to display a diff for these, this unpacks the ZIP files to directories that are tracked in Git. Tracking generated files is pretty dumb, but so is tracking binary files and when forced to have one, it's not a leap to have the other as well if it bring something useful to the table.

This is achieved using a PowerShell script which unpacks the ZIP file to a tracked directory, formats the XML files for nice diff and tracks the formatted files as well.

Examples:

The XML diff captures the exact change whereas the TXT diff captures text-only change for quick content inspection.

Features:

Limitations:

Support:

Running

Run PowerShell scripts using VS Code PowerShell Integrated Console to avoid security blocks. Open it by clicking on any .ps1 file with integrated terminal open or running the PowerShell: Show Integrated Console VS Code command (F1+(p+s+c+i)).

cp .git/hooks/pre-commit.sample .git/hooks/pre-commit
code .git/hooks/pre-commit

Observe commit diffs to see Office file changes in the XML and TXT files.

Testing

Run PowerShell scripts using VS Code PowerShell Integrated Console to avoid security blocks. Open it by clicking on any .ps1 file with integrated terminal open or running the PowerShell: Show Integrated Console VS Code command (F1+(p+s+c+i)).

Run cmd/run-tests.ps1 which will run NodeJS tests in test/ (prerequisites).

In this repository, the tests run together with the main script in a pre-commit hook in order to catch any bugs as soon as possible during development. When using this script as a tool in a repository other than this one, only the main script would be ran as shown in the Git pre-commit hook setup code.

Continuous Integration

Even though it is recommended to run the test suite as a part of the development pre-commit hook, that won't cover GitHub/GitLab online editor contributions.

See Contributing for details about repository hosting and mirroring.

GitHub

See tasks.

GitLab

GitLab pipeline status

See .gitlab-ci.yml for configuration. Debugging the configuration is contrived because the GitLab repository is set as a pull mirror, which means introducing changes in it will stop the pulling from happening. The configuration can either be changed in the GitHub repository, with validation confirmed once another pull happens and the pipeline runs, or by setting up a temporary clone in GitLab, tweaking the configuration in it and then porting working changes over to the GitHub source repository.

Licensing

This repository is licensed under the MIT license.

Contributing

The project is hosted on GitHub and is mirrored to GitLab using 'pull' repository mirroring.

Use hook/pre-commit-development.sh when contributing to this repository to also run tests.

See planned development.

Studying

See git log and development notes.

Some notable prior art:

All of these focus on on-demand (non-tracked) generating of text-only versions of the files, do not capture structure changes. This project aims to explore the other, potentially less useful, but nonetheless interesting, route of versioning both the compressed and the uncompressed forms of a file in parallel. See features and drawback for pros and cons.

Changes (144)
2018-1-20 18:49:11 Tomáš Hübelbauer
Turn into a bloggo
2018-1-9 13:09:59 Tomáš Hübelbauer
Add a task to fix GitLab CI failing
2018-1-9 13:08:44 Tomáš Hübelbauer
Add Git installation and fix NVM installation
2018-1-8 08:48:07 Tomáš Hübelbauer
Add GitLab CI debugging hint
2018-1-8 08:46:11 Tomáš Hübelbauer
Remove sudo as the image seems to be lacking it
2018-1-7 20:47:56 Tomáš Hübelbauer
Add CI badge on top as well
2018-1-7 20:46:34 Tomáš Hübelbauer
Split up and update CI tasks
2018-1-7 20:42:09 Tomáš Hübelbauer
Display GitLab CI pipeline status
2018-1-7 20:40:23 Tomáš Hübelbauer
Update GitLab CI script to install deps
2018-1-7 11:42:58 Tomáš Hübelbauer
Update dev log
2018-1-7 11:41:56 Tomáš Hübelbauer
Add a task to fix comitting through a GUI
2018-1-7 11:38:48 Tomáš Hübelbauer
Use pwsh in Ubuntu based Docker image
2018-1-5 10:17:47 Tomáš Hübelbauer
Try to fix GitLab CI permission error
2018-1-5 09:16:19 Tomáš Hübelbauer
Set up GitLab CI
2018-1-4 15:12:03 Tomáš Hübelbauer
Document mirroring and CI
2018-1-4 14:52:36 Tomáš Hübelbauer
Finish all planned tests
2018-1-4 12:05:41 Tomáš Hübelbauer
Stub test files for suggested tests
2018-1-4 11:58:06 Tomáš Hübelbauer
Fix hook links and run test in the pre-commit hook
2018-1-4 10:51:30 Tomáš Hübelbauer
Add a task to fix MarkDown syntax
2018-1-4 10:50:22 Tomáš Hübelbauer
Differentiate pre-commit hooks for tests
2018-1-4 10:06:17 Tomáš Hübelbauer
Add the idea with CI tests
2018-1-4 10:05:50 Tomáš Hübelbauer
Add a task to runs tests with each commit
2018-1-4 10:04:42 Tomáš Hübelbauer
Add descriptions to test ideas
2018-1-4 10:00:29 Tomáš Hübelbauer
Pull test util functions out to enable more test types
2018-1-3 16:20:13 Tomáš Hübelbauer
Report all run results at the end of test run
2018-1-3 16:17:31 Tomáš Hübelbauer
Add a basic PowerPoint test
2018-1-3 16:02:52 Tomáš Hübelbauer
Add basic Excel test
2018-1-3 14:31:40 Tomáš Hübelbauer
Add full shortcut for the PowerShell Integrated Console
2018-1-3 11:10:31 Tomáš Hübelbauer
Update pre-commit hook code in README
2018-1-3 11:08:07 Tomáš Hübelbauer
Distinguish powershell and pwsh commands
2018-1-2 19:53:28 Tomáš Hübelbauer
Add a task to distinguish powershell and pwsh
2018-1-2 19:52:51 Tomáš Hübelbauer
Verify PowerShell works on Ubuntu
2018-1-2 18:49:41 Tomáš Hübelbauer
Document Git prerequisite for tests
2018-1-2 18:36:55 Tomáš Hübelbauer
Remove finished license task
2018-1-2 16:18:52 Tomáš Hübelbauer
Link the license file from the README
2018-1-2 16:18:25 Tomáš Hübelbauer
Add licensing information
2018-1-2 16:17:54 Tomáš Hübelbauer
Create LICENSE.md
2018-1-2 16:16:13 Tomáš Hübelbauer
Dot dot back to root repository directory after running tests
2018-1-2 16:15:34 Tomáš Hübelbauer
Add instructions for running tests
2018-1-2 16:13:37 Tomáš Hübelbauer
Implement basis for writing tests
2018-1-2 14:27:23 Tomáš Hübelbauer
Add supported system version table
2018-1-2 12:54:46 Tomáš Hübelbauer
Rethink and document the approach to tests
2018-1-1 22:14:58 Tomáš Hübelbauer
Explain task for adding tests
2018-1-1 22:13:04 Tomáš Hübelbauer
Update dev log for today
2018-1-1 22:08:47 Tomáš Hübelbauer
Update features and limitations
2018-1-1 22:01:37 Tomáš Hübelbauer
Return temporarily disabled change check
2018-1-1 22:01:10 Tomáš Hübelbauer
Generate 'generated' comments
2018-1-1 21:50:46 Tomáš Hübelbauer
Add generated content warning files and comments task
2018-1-1 21:46:53 Tomáš Hübelbauer
Clarify how to open PowerShell Integrated Console in VS Code
2018-1-1 21:44:25 Tomáš Hübelbauer
Add a task to verify portability
2018-1-1 21:42:03 Tomáš Hübelbauer
Simplify README.md and add PS ISE script
2018-1-1 21:33:06 Tomáš Hübelbauer
Fix missing staged changed files
2018-1-1 21:30:39 Tomáš Hübelbauer
Added tasks for tests and PS ISE
2018-1-1 21:25:35 Tomáš Hübelbauer
Scrape block element recognition attempt
2018-1-1 21:24:02 Tomáš Hübelbauer
Update tasks to scrape block elements
2018-1-1 21:18:39 Tomáš Hübelbauer
Add a task to fix skipping
2018-1-1 21:16:53 Tomáš Hübelbauer
Add title element to Word demo
2018-1-1 21:16:08 Tomáš Hübelbauer
Update tasks and notes
2018-1-1 21:14:42 Tomáš Hübelbauer
Draft surrounding block text nodes with blank lines
2018-1-1 20:59:55 Tomáš Hübelbauer
Update dev log
2018-1-1 20:58:13 Tomáš Hübelbauer
Expand planned development tasks
2018-1-1 20:43:27 Tomáš Hübelbauer
Remove completed task to skip unchanged files
2018-1-1 20:42:53 Tomáš Hübelbauer
Implement skipping unchanged files
2018-1-1 20:32:48 Tomáš Hübelbauer
Move demo files to own folder
2018-1-1 20:25:31 Tomáš Hübelbauer
Add a new drawback
2018-1-1 20:23:58 Tomáš Hübelbauer
Sort priot art by year
2017-12-31 13:05:43 Tomáš Hübelbauer
Improve Running section paragraph structure
2017-12-31 12:42:07 Tomáš Hübelbauer
Credit prior art
2017-12-31 12:31:40 Tomáš Hübelbauer
Update project tagline
2017-12-31 12:30:13 Tomáš Hübelbauer
Clarify running the script without security error
2017-12-31 12:29:05 Tomáš Hübelbauer
Update project name
2017-12-31 12:25:20 Tomáš Hübelbauer
Link to example diff commits
2017-12-31 12:23:00 Tomáš Hübelbauer
Demonstrate Excel diffing
2017-12-31 12:22:34 Tomáš Hübelbauer
Demonstrate Word diffing
2017-12-31 12:21:43 Tomáš Hübelbauer
Encode TXT diff files as UTF8
2017-12-31 12:19:21 Tomáš Hübelbauer
Demonstrate Excel diffing
2017-12-31 12:18:45 Tomáš Hübelbauer
Demonstrate Word diffing
2017-12-31 12:17:41 Tomáš Hübelbauer
Call out main feature
2017-12-31 12:16:33 Tomáš Hübelbauer
Add new tasks
2017-12-31 12:14:46 Tomáš Hübelbauer
Call out features in the README
2017-12-31 12:12:12 Tomáš Hübelbauer
Clear task backlog
2017-12-31 12:11:38 Tomáš Hübelbauer
Generate text-only files for lossy diff
2017-12-31 11:55:59 Tomáš Hübelbauer
Remove completed task
2017-12-31 11:55:37 Tomáš Hübelbauer
Reword
2017-12-31 11:51:12 Tomáš Hübelbauer
Improve invocation example
2017-12-31 11:47:42 Tomáš Hübelbauer
Clean up README.md and tasks
2017-12-31 11:41:11 Tomáš Hübelbauer
Remove PowerPoint test
2017-12-31 11:38:59 Tomáš Hübelbauer
Test tracking whole .git directory not just XML files
2017-12-31 11:38:23 Tomáš Hübelbauer
Remove artifact
2017-12-31 11:37:50 Tomáš Hübelbauer
Fix directory path
2017-12-31 11:37:25 Tomáš Hübelbauer
Add new PowerPoint to test adding non-XML
2017-12-31 11:36:43 Tomáš Hübelbauer
Remove abandoned without asking
2017-12-31 11:35:50 Tomáš Hübelbauer
Improve tracking and test abandoned directory cleanup
2017-12-31 11:34:48 Tomáš Hübelbauer
Add new tasks
2017-12-31 11:33:01 Tomáš Hübelbauer
Commit non-XML files
2017-12-31 11:32:28 Tomáš Hübelbauer
Implement disposing abandoned extractions
2017-12-31 09:37:35 Tomáš Hübelbauer
Update the Office files for diff
2017-12-31 09:36:31 Tomáš Hübelbauer
Make the Office files blank
2017-12-31 09:34:39 Tomáš Hübelbauer
Clean up the README.md file
2017-12-31 09:21:50 Tomáš Hübelbauer
Make more changes
2017-12-31 09:20:19 Tomáš Hübelbauer
Make a change to the docs
2017-12-31 09:18:48 Tomáš Hübelbauer
Use new docs and update tasks
2017-12-31 09:16:05 Tomáš Hübelbauer
Extend to cover any Office files
2017-12-31 09:01:33 Tomáš Hübelbauer
Fix typos
2017-12-31 08:59:00 Tomáš Hübelbauer
Fix project title
2017-12-31 08:58:25 Tomáš Hübelbauer
Do one more change for the shits and giggles
2017-12-31 08:56:42 Tomáš Hübelbauer
Do a real last change for a test
2017-12-31 08:55:41 Tomáš Hübelbauer
Add instructions for setting up the pre-commit hook
2017-12-31 08:54:04 Tomáš Hübelbauer
Do one last test
2017-12-31 08:52:59 Tomáš Hübelbauer
Add planned contributions and development log
2017-12-31 08:46:16 Tomáš Hübelbauer
Make more changes
2017-12-31 08:45:46 Tomáš Hübelbauer
Fix git add absolute path
2017-12-31 08:43:25 Tomáš Hübelbauer
Make other changes
2017-12-31 08:42:44 Tomáš Hübelbauer
Track changed generated files instead of waiting
2017-12-31 08:40:19 Tomáš Hübelbauer
Make more changes
2017-12-31 08:39:36 Tomáš Hübelbauer
Increase dangling file timeout to 5s
2017-12-31 08:38:52 Tomáš Hübelbauer
Add dry run notes and dangling files
2017-12-31 08:37:13 Tomáš Hübelbauer
Make some changes
2017-12-31 08:33:56 Tomáš Hübelbauer
Add dangling file protection
2017-12-31 08:32:53 Tomáš Hübelbauer
Add try-this-out instructions
2017-12-31 08:30:09 Tomáš Hübelbauer
Use literal path to fix loading error
2017-12-31 08:23:02 Tomáš Hübelbauer
Add dangling files and try to format XML
2017-12-31 08:19:06 Tomáš Hübelbauer
Make changes to the DOCX file for diff
2017-12-31 08:17:59 Tomáš Hübelbauer
Force expansion to overwrite and remove Done message
2017-12-31 08:16:57 Tomáš Hübelbauer
Commit expanded archive for diffs
2017-12-31 08:16:06 Tomáš Hübelbauer
Use temporary copy with ZIP extension
2017-12-31 08:14:12 Tomáš Hübelbauer
Add ZIP expansion to PowerShell
2017-12-31 08:13:01 Tomáš Hübelbauer
Update doc with latest findings
2017-12-31 08:09:49 Tomáš Hübelbauer
Try WSL with full path
2017-12-31 08:07:06 Tomáš Hübelbauer
Clean up document and try last attempt with WSL
2017-12-31 08:00:32 Tomáš Hübelbauer
Replace fully with PowerShell
2017-12-31 07:47:13 Tomáš Hübelbauer
Use uname -a to distinguish
2017-12-31 07:46:12 Tomáš Hübelbauer
Distringuish scripts
2017-12-31 07:45:38 Tomáš Hübelbauer
Add whoami to distinguish systems
2017-12-31 07:43:35 Tomáš Hübelbauer
Try a hack to pick up Windows Bash
2017-12-31 07:38:42 Tomáš Hübelbauer
Invoke 7z to see if it exists
2017-12-31 07:38:17 Tomáš Hübelbauer
Restructure doc and use Bash shebang
2017-12-31 07:21:20 Tomáš Hübelbauer
Move script to file from inline
2017-12-31 07:19:55 Tomáš Hübelbauer
Figure out hook working directory
2017-12-31 07:19:09 Tomáš Hübelbauer
Document all approaches to MinGW
2017-12-31 07:17:51 Tomáš Hübelbauer
Try bash instead of ubuntu to see about permissions
2017-12-31 07:13:45 Tomáš Hübelbauer
Switch to PowerShell
2017-12-31 06:58:23 Tomáš Hübelbauer
Add a shebang to fix the hook not running https://stackoverflow.com/a/5697993/2715716
2017-12-31 06:49:57 Tomáš Hübelbauer
Create README.md
Comments E-mail me!