Organization is the act of reducing friction between the continuous generation of actionable ideas and their implementation.
I've wasted too much time organizing my stuff.
Identify what you do
For example I am a web/mobile app developer, owner of my own business. So I need to have a way to easily organize my files, create new projects, have them synchronized across devices and be able to write and update content for the blog.
Pondering of options
When I create a project on my local machine, I need to be able to push it to the outer world. A dedicated IP address is required to link it to a publicly accessible domain name, so I need a fixed IP address. Since I don't have it locally, I can either call the ISP and ask them what the price is for a dedicated IP address is. But I may move to a different city, so transferring all those domains to a new IP address seems cumbersome. Moving the whole process to a company which offers a dedicated IP address appears as a natural best option. The cost evaluation ISP (swisscom) :
- Option 1 fixed IP address 10.-/month
- Option 4 fixed IP addresses 20.-/month
- Option 8 fixed IP addresses 30.-/month
Using infomaniak dedicated hosting costs 40 CHF per month with dedicated IP address as well as a 100 GB of storage. This has the added value of keeping same configuration whether you move or not.
Breaking up requirements
I want:
- A synchronization of my personal files across devices
- allows downloading the latest version of any file to my machine
- I don't care about the directory structure so much as long as I know that a file that I change in one computer, is changed across devices. Meaning that decoupling it from folder structure (and possibly name) will not detach it from its own history
- Every time I create a new file, I want it backed up in the cloud as soon as there is internet connectivity
- A compatibility of use with my coding projects tracked by git
- Code projects do require a file structure and should be backed up with a structure
- The ability to write/modify content, and push that content to a publicly accessible "blog" in a frictionless way
- Should allow specification of the public endpoint: "My Personal Blog" or "My Other Blog" or "My Business Blog" etc.
- A way to publish projects live
- Some hostings provide easy to install environments (wordpress, nodjs etc.) using those is a plus
Solutions
1.2
requires some meta data to be stored along with the file for example a UUID. But that metadata cannot be stored in the file itself, since it would change the raw file data. So the meta should be decoupled from the file data. It should be stored in a local database, containing the path
, MD5Hash
, CloudUUID
, DateModified
, so whenever:
A file is downloaded from the cloud, we need to make sure that there is no other version of the same file in the local machine:
- The file can already exist in the machine, but has not been labeled as
trackable
- therefore the local database does not know about its existence.
- if that untracked local file with say version L, is ever switched to
trackable
we may have a duplicate locally :- when they have the same md5, the program will need to decide whether to keep both locally, by letting both evolve as separate versions (aka different files) or to symlink one to the other. So ask the user: "do you want to keep these two files always equal, or they are going to become different over time?"
- when they already have a different md5 the local file L will be uploaded as a different file.
- eventually, with the help of the search engine, it will appear really close to the other in the search results and the user will eventually decide to delete one of the versions if they want a single version alive.
- in that latter case, let's say the user deletes the C version, a different machine containing the C version will have to delete the C version locally and the user will have to download the L version manually or automatically based on the synchronization greediness
- eventually, with the help of the search engine, it will appear really close to the other in the search results and the user will eventually decide to delete one of the versions if they want a single version alive.
- if that untracked local file with say version L, is ever switched to
- therefore the local database does not know about its existence.
- The file
cloudUUID
already exists in the machine- the local file L has a different md5 than the cloud md5 meaning these are two versions of the same file
- the user should decide which to set as the latest version (either use the default modified date as indicator, or decide manually)
- the local file is the same as the cloud file. Tell the user where the local file is and ask him whether to download a copy, symlink, or "show in finder" (see
1.1.1.1
).
- the local file L has a different md5 than the cloud md5 meaning these are two versions of the same file
When we talk about localDB, it does not necessarily need to be local. It can be stored in the server. Using the user-agent to map the paths to the machine in question. But there is no need, having it locally is much more cost effective and fast.
check whether we have its UUID in local DB. If
CloudUUID
is in db then we need to make sure to sync both local and cloud with the latest version (the cloud will be updated on next scheduled backup, so we actually only care about local). We compare the localDateModified
with the cloudDateModified
of course the date modified does not tell us anything about whether the modified contents are starting off of the latest version or from an unsynced old version. So we should use something like git to track file changes but this imposes the structure on every device. Using FSEvents to listen to changes to files could be an alternate solution: then comparing the previousMD5Hash
with the current one, to tell whether there was a change. IMPORTANT: when a file is upped to the cloud, Stackive checks whether there is such a data file MD5, and if there is not, it stores the data file as a new UUID. On the other hand, if there is, it uses the existing UUID. This was made to avoid redundancy. But when setting the goal to upload new versions of the same file, this will generate a lot of redundancy, since every new version of the file will be stored as an entire file with a new UUID. Using git's diff feature and storing the entire file history instead of a new file for ever snapshot would be less redundant. To do so we would need to track the cloud file structure with git, every time a file is being uploaded, we check the latest reference (commit ref) that the local machine has, and find the UUID of that ref in the cloud (must be latest in file history), replace the file in the cloud fs with the new version uploaded from local machine, add it to git history.
IMPORTANT: a distinction has to be made between the initial purpose of stackive (i.e. being the bridge between printed and numeric life). Whereas managing a whole file system is a different story, since versioning can be interesting for digital files that get modified over time.
Git tracking footprint
When initializing an empty repository, the size of a directory prior to init is :
mkdir testDir
cd testDir
du -bsh . # 4K
Then if we add a file with some text to make it 200 Bytes :
vim hello.txt
# add some text, then
ls -la hello.txt # -rw-r--r-- 1 g g 200 May 3 16:10 hello.txt
Now make sure our numbers are still correct :
du -bsh . # 4.2K - Good
Let's init an empty git repository :
git init # Initialized empty Git repository in /home/g/testDir
du -bsh # 64 K
Now we got the empty git repository footprint : ~ 60K
.
If we now start tracking our hello.txt
file of 200B
size, let's see how much the directory grows :
git add .
git commit -m "feat: add 200B file to history"
du -bsh # 92K
The thing is that tracking the history of a file with git, will roughly double the storage size requirements for big files and more for small files (when tracking a single point in time). This is because git will generate many objects
inside the .git
directory once we call git add .
. For example :
- tracking a
767MB
file will increase the size of the containing git repository to1.5GB
- tracking a
0.2K
file will increase the size of the containing git repository to92K
All in all, the storage increase after deducting git's generic size footprint is :
2
x767M
=1.5G
160
x0.2K
=32K
Changing 50% of the 0.2K
file results in an increase of directory size after commit of 13K
. Which is a factor of 65
increase. Anyhow tracking file history is very space consuming.
IMPORTANT : it seems that an intermediate solution of storing the latest version and the previous version is good enough for most files. But then, how do we link different versions of the same file together ?