How do I add big HTTP files in a Dockerfile and exclude them from image layers?

Our Nexus server provides build artifacts for our Java project including its installer. That installer is really big (>1GB). I would like to retrieve and use it in a Dockerfile.

What I did so far is the following:

  • Merging dokku's version-controlled directories with docker's volumes
  • nginx windows 10 upstream timeout
  • Docker: Multiple MySQL instances/containers - second slave time out
  • Nginx content caching causing Docker memory spike
  • Docker Java application - SSLHandshakeException / No trusted certificate found
  • Custom Container for hosting OpenLDAP in Bluemix
  • FROM debian:jessie
    ...
    RUN apt-get install -y curl xmllib-xpath-perl
    ENV PROJECT_VERSION x.y.z-SNAPSHOT
    ...
    RUN VERSION=`curl --silent "http://nexus:8081/service/local/artifact/maven/resolve?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64" | xpath -q -s '' -e '//data/version/text()'` \
        && echo Version:\'${VERSION}\' \
        && curl --silent http://nexus/content/groups/public/my/group/id/installer/${PROJECT_VERSION}/installer-${VERSION}-linux64.sh \
            --create-dirs \
            --output ${INSTALL_DIR}/installer.sh \
        && sh ${INSTALL_DIR}/installer.sh <someArgs> \
        && rm ${INSTALL_DIR}/installer.sh
    ...
    

    With that approach I am able to:

    • Query Nexus to provide the latest SNAPSHOT version for the provided ${PROJECT_VERSION} which is logged out during docker build
    • Use that version to download the corresponding installer binary
    • Execute the installer binary
    • Delete the installer binary immediately after execution to not have it stored within the created Docker image layer

    What is missing:

    • Whenever a new installer gets deployed to Nexus I have to build the Docker image with docker build --no-cache. Otherwise Docker is not able to invalidate its cache and re-run the installation step for a newer installer that was meanwhile deployed to Nexus.

    So I tried a different approach using the ADD statement as those have caching capabilities according to the documentation. But that does not work since I need to provide a parameter to the ADD statement that is set by a previous step querying Nexus for the correct SNAPSHOT version:

    FROM debian:jessie
    ...
    RUN apt-get install -y curl xmllib-xpath-perl
    ENV PROJECT_VERSION x.y.z-SNAPSHOT
    ...
    ADD http://nexus:8081/service/local/artifact/maven/resolve?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64 ${INSTALL_DIR}/version.xml
    RUN cat ${INSTALL_DIR}/version.xml | xpath -q -s '' -e '//data/version/text()' > ${INSTALL_DIR}/version.txt
    
    # FIXME: Somehow do a `cat ${INSTALL_DIR}/version.txt to set the ENV ${VERSION} variable ?!
    
    ADD http://nexus/content/groups/public/my/group/id/installer/${PROJECT_VERSION}/installer-${VERSION}-linux64.sh ${INSTALL_DIR}/installer.sh
    RUN ${INSTALL_DIR}/installer.sh <someArgs> && rm ${INSTALL_DIR}/installer.sh
    ...
    

    That approach does not work because:

    • It is not possible to set the ${VERSION} environment variable within the Dockerfile to the version stored within the version.txt file.
    • It is not possible to prevent having the installer stored within an image layer.

    But at least this would use proper caching to re-use existing image layers for old installer versions and create new ones whenever a new installer version on Nexus gets deployed.

    So the question is: How do I enable proper caching, cache invalidation and exclusion of the big installer file from the Docker image layers at the same time?

    EDIT: I found a way to get the caching of image layers working properly by using an other Nexus API:

    FROM debian:jessie
    ...
    ENV PROJECT_VERSION x.y.z-SNAPSHOT
    ...
    ADD http://nexus:8081/service/local/artifact/maven/content?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64 ${INSTALL_DIR}/installer.sh
    RUN sh ${INSTALL_DIR}/installer.sh <someArgs> \
        && rm ${INSTALL_DIR}/installer.sh
    ...
    

    But still the problem of having a very big installer file included in the image layers remains since in that code snipped the ADD mechanism is used.

    Any ideas about how to benefit from the caching and its correct invalidation provided by the ADD statement but at the same time not include the added file into the images history?

  • docker running in vmware gets x509 error
  • Access Cassandra from separate docker container using docker-compose
  • How to suppress marathon accessing logs in “docker logs”?
  • error: database is uninitialized and MYSQL_ROOT_PASSWORD not set
  • Marathon won't launch docker container
  • Docker: In Dockerfile, copy files temporarily, but not for final image
  • 2 Solutions collect form web for “How do I add big HTTP files in a Dockerfile and exclude them from image layers?”

    How about doing curl/wget, install and remove in one long run command?

    Update in combination with ADD of a smaller resource, see TC’s detailed answer.

    I accepted Mykola Gurovs answer because in one of his comments he pointed out an idea that helped me to solve this issue.

    Here is what I did to have proper caching and cache invalidation as well as having the big installer file excluded:

    FROM debian:jessie
    ...
    RUN apt-get install -y curl
    ENV PROJECT_VERSION x.y.z-SNAPSHOT
    ...
    ADD http://nexus:8081/service/local/artifact/maven/resolve?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64 ${INSTALL_DIR}/installer.xml
    RUN curl --silent "http://nexus:8081/service/local/artifact/maven/content?r=public&g=my.group.id&a=installer&v=${PROJECT_VERSION}&e=sh&c=linux64" \
            --output ${INSTALL_DIR}/installer.sh \
        && sh ${INSTALL_DIR}/installer.sh <someArgs> \
        && rm ${INSTALL_DIR}/installer.sh
    ...
    

    The first ADD downloads the Maven metadata for the requested artifact. That XML file is quite small. It uses proper caching so whenever the metadata on the Nexus has been modified the cache gets invalidated.

    The ADD and all its following instructions are executed without re-using any cached versions in that case.

    If the metadata on the server did not change since the last download the ADD and the following RUN instruction which executes curl are taken from the image layer cache. And in the RUN it is possible to download, execute and remove the temporary big installer file in one step without having it stored in any image layers.

    Docker will be the best open platform for developers and sysadmins to build, ship, and run distributed applications.