In many of my cases, either the source code production on the review computer comprises one or more archive files (e.g., ZIP files) or the production is a folder tree that contains one or more archive files.
Some archive files, when unarchived, expand to a folder tree that contains more archive files, and some of those more archive files, when unarchived, expand to a folder tree that contains even more archive files, etc.
Therefore, in order to expose all the files in the production, there are two important steps to take:
- Identify all archive files
- Recursively unarchive each identified archive file
The first step might seem easy: find all the files with a “.zip” file extension. However, in addition to the “.zip” extension, there are many file types which contain one or more files and folders.
From my experience, the 7-Zip tool unarchives the largest number of archive file types. The 7-Zip tool can unarchive more than 100 different archive file types. For example, even though the following file extensions might be thought to identify opaque binary file types, they are actually archives which contain other files and folders:
- .docx is a modern Microsoft Word file that is an archive of files which comprise the content of the document and meta data about the document. Below is an example of the folder tree created after unarchiving a file called Doc.docx, where the contents of the Microsoft Word document are in the file “Doc\word\document.xml”, and the contents of a Microsoft Excel document embedded within the Microsoft Word file are in the file “Doc\word\embeddings\Worksheet.xlsx”:
Doc
├── [Content_Types].xml
├── _rels
├── docProps
│ ├── app.xml
│ └── core.xml
└── word
├── _rels
│ └── document.xml.rels
├── document.xml
├── embeddings
│ └── Worksheet.xlsx
├── fontTable.xml
├── settings.xml
├── styles.xml
├── theme
│ └── theme1.xml
└── webSettings.xml
- .pptx is a modern Microsoft Powerpoint file that is an archive of files which comprise the content of the slides and meta data about the slides.
- .xlsx is a modern Microsoft Excel file that is an archive of files which comprise the content of the spreadsheet and meta data about the spreadsheet.
- .jar is a Java ARchive file that is an archive of Java class files.
- .apk is an Android Package Kit file that is an archive of files used to implement an Android app.
The most popular archive file type that 7-Zip does not unarchive without the requirement to install 7-Zip plugins, is the Roshal ARchive file type (i.e., .rar).
The following is a script which uses 7-Zip (for most all archives) and unar (for .rar archives) to recursively unarchive the files in a production. The documentation for the script is contained in the comments in the script itself.
#!/usr/bin/env python3
"""
Copyright 2020-2021 Stairstep Consulting LLC. All rights reserved.
Creative Commons Attribution-ShareAlike 4.0 International Public
License
By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-ShareAlike 4.0 International Public License ("Public
License"). To the extent this Public License may be interpreted as a
contract, You are granted the Licensed Rights in consideration of Your
acceptance of these terms and conditions, and the Licensor grants You
such rights in consideration of benefits the Licensor receives from
making the Licensed Material available under these terms and
conditions.
Section 1 -- Definitions.
a. Adapted Material means material subject to Copyright and Similar
Rights that is derived from or based upon the Licensed Material
and in which the Licensed Material is translated, altered,
arranged, transformed, or otherwise modified in a manner requiring
permission under the Copyright and Similar Rights held by the
Licensor. For purposes of this Public License, where the Licensed
Material is a musical work, performance, or sound recording,
Adapted Material is always produced where the Licensed Material is
synched in timed relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright
and Similar Rights in Your contributions to Adapted Material in
accordance with the terms and conditions of this Public License.
c. BY-SA Compatible License means a license listed at
creativecommons.org/compatiblelicenses, approved by Creative
Commons as essentially the equivalent of this Public License.
d. Copyright and Similar Rights means copyright and/or similar rights
closely related to copyright including, without limitation,
performance, broadcast, sound recording, and Sui Generis Database
Rights, without regard to how the rights are labeled or
categorized. For purposes of this Public License, the rights
specified in Section 2(b)(1)-(2) are not Copyright and Similar
Rights.
e. Effective Technological Measures means those measures that, in the
absence of proper authority, may not be circumvented under laws
fulfilling obligations under Article 11 of the WIPO Copyright
Treaty adopted on December 20, 1996, and/or similar international
agreements.
f. Exceptions and Limitations means fair use, fair dealing, and/or
any other exception or limitation to Copyright and Similar Rights
that applies to Your use of the Licensed Material.
g. License Elements means the license attributes listed in the name
of a Creative Commons Public License. The License Elements of this
Public License are Attribution and ShareAlike.
h. Licensed Material means the artistic or literary work, database,
or other material to which the Licensor applied this Public
License.
i. Licensed Rights means the rights granted to You subject to the
terms and conditions of this Public License, which are limited to
all Copyright and Similar Rights that apply to Your use of the
Licensed Material and that the Licensor has authority to license.
j. Licensor means the individual(s) or entity(ies) granting rights
under this Public License.
k. Share means to provide material to the public by any means or
process that requires permission under the Licensed Rights, such
as reproduction, public display, public performance, distribution,
dissemination, communication, or importation, and to make material
available to the public including in ways that members of the
public may access the material from a place and at a time
individually chosen by them.
l. Sui Generis Database Rights means rights other than copyright
resulting from Directive 96/9/EC of the European Parliament and of
the Council of 11 March 1996 on the legal protection of databases,
as amended and/or succeeded, as well as other essentially
equivalent rights anywhere in the world.
m. You means the individual or entity exercising the Licensed Rights
under this Public License. Your has a corresponding meaning.
Section 2 -- Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License,
the Licensor hereby grants You a worldwide, royalty-free,
non-sublicensable, non-exclusive, irrevocable license to
exercise the Licensed Rights in the Licensed Material to:
a. reproduce and Share the Licensed Material, in whole or
in part; and
b. produce, reproduce, and Share Adapted Material.
2. Exceptions and Limitations. For the avoidance of doubt, where
Exceptions and Limitations apply to Your use, this Public
License does not apply, and You do not need to comply with
its terms and conditions.
3. Term. The term of this Public License is specified in Section
6(a).
4. Media and formats; technical modifications allowed. The
Licensor authorizes You to exercise the Licensed Rights in
all media and formats whether now known or hereafter created,
and to make technical modifications necessary to do so. The
Licensor waives and/or agrees not to assert any right or
authority to forbid You from making technical modifications
necessary to exercise the Licensed Rights, including
technical modifications necessary to circumvent Effective
Technological Measures. For purposes of this Public License,
simply making modifications authorized by this Section 2(a)
(4) never produces Adapted Material.
5. Downstream recipients.
a. Offer from the Licensor -- Licensed Material. Every
recipient of the Licensed Material automatically
receives an offer from the Licensor to exercise the
Licensed Rights under the terms and conditions of this
Public License.
b. Additional offer from the Licensor -- Adapted Material.
Every recipient of Adapted Material from You
automatically receives an offer from the Licensor to
exercise the Licensed Rights in the Adapted Material
under the conditions of the Adapter's License You apply.
c. No downstream restrictions. You may not offer or impose
any additional or different terms or conditions on, or
apply any Effective Technological Measures to, the
Licensed Material if doing so restricts exercise of the
Licensed Rights by any recipient of the Licensed
Material.
6. No endorsement. Nothing in this Public License constitutes or
may be construed as permission to assert or imply that You
are, or that Your use of the Licensed Material is, connected
with, or sponsored, endorsed, or granted official status by,
the Licensor or others designated to receive attribution as
provided in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not
licensed under this Public License, nor are publicity,
privacy, and/or other similar personality rights; however, to
the extent possible, the Licensor waives and/or agrees not to
assert any such rights held by the Licensor to the limited
extent necessary to allow You to exercise the Licensed
Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this
Public License.
3. To the extent possible, the Licensor waives any right to
collect royalties from You for the exercise of the Licensed
Rights, whether directly or through a collecting society
under any voluntary or waivable statutory or compulsory
licensing scheme. In all other cases the Licensor expressly
reserves any right to collect such royalties.
Section 3 -- License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the
following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified
form), You must:
a. retain the following if it is supplied by the Licensor
with the Licensed Material:
i. identification of the creator(s) of the Licensed
Material and any others designated to receive
attribution, in any reasonable manner requested by
the Licensor (including by pseudonym if
designated);
ii. a copyright notice;
iii. a notice that refers to this Public License;
iv. a notice that refers to the disclaimer of
warranties;
v. a URI or hyperlink to the Licensed Material to the
extent reasonably practicable;
b. indicate if You modified the Licensed Material and
retain an indication of any previous modifications; and
c. indicate the Licensed Material is licensed under this
Public License, and include the text of, or the URI or
hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any
reasonable manner based on the medium, means, and context in
which You Share the Licensed Material. For example, it may be
reasonable to satisfy the conditions by providing a URI or
hyperlink to a resource that includes the required
information.
3. If requested by the Licensor, You must remove any of the
information required by Section 3(a)(1)(A) to the extent
reasonably practicable.
b. ShareAlike.
In addition to the conditions in Section 3(a), if You Share
Adapted Material You produce, the following conditions also apply.
1. The Adapter's License You apply must be a Creative Commons
license with the same License Elements, this version or
later, or a BY-SA Compatible License.
2. You must include the text of, or the URI or hyperlink to, the
Adapter's License You apply. You may satisfy this condition
in any reasonable manner based on the medium, means, and
context in which You Share Adapted Material.
3. You may not offer or impose any additional or different terms
or conditions on, or apply any Effective Technological
Measures to, Adapted Material that restrict exercise of the
rights granted under the Adapter's License You apply.
Section 4 -- Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right
to extract, reuse, reproduce, and Share all or a substantial
portion of the contents of the database;
b. if You include all or a substantial portion of the database
contents in a database in which You have Sui Generis Database
Rights, then the database in which You have Sui Generis Database
Rights (but not its individual contents) is Adapted Material,
including for purposes of Section 3(b); and
c. You must comply with the conditions in Section 3(a) if You Share
all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.
Section 5 -- Disclaimer of Warranties and Limitation of Liability.
a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
c. The disclaimer of warranties and limitation of liability provided
above shall be interpreted in a manner that, to the extent
possible, most closely approximates an absolute disclaimer and
waiver of all liability.
Section 6 -- Term and Termination.
a. This Public License applies for the term of the Copyright and
Similar Rights licensed here. However, if You fail to comply with
this Public License, then Your rights under this Public License
terminate automatically.
b. Where Your right to use the Licensed Material has terminated under
Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided
it is cured within 30 days of Your discovery of the
violation; or
2. upon express reinstatement by the Licensor.
For the avoidance of doubt, this Section 6(b) does not affect any
right the Licensor may have to seek remedies for Your violations
of this Public License.
c. For the avoidance of doubt, the Licensor may also offer the
Licensed Material under separate terms or conditions or stop
distributing the Licensed Material at any time; however, doing so
will not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
License.
Section 7 -- Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different
terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the
Licensed Material not stated herein are separate from and
independent of the terms and conditions of this Public License.
Section 8 -- Interpretation.
a. For the avoidance of doubt, this Public License does not, and
shall not be interpreted to, reduce, limit, restrict, or impose
conditions on any use of the Licensed Material that could lawfully
be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is
deemed unenforceable, it shall be automatically reformed to the
minimum extent necessary to make it enforceable. If the provision
cannot be reformed, it shall be severed from this Public License
without affecting the enforceability of the remaining terms and
conditions.
c. No term or condition of this Public License will be waived and no
failure to comply consented to unless expressly agreed to by the
Licensor.
d. Nothing in this Public License constitutes or may be interpreted
as a limitation upon, or waiver of, any privileges and immunities
that apply to the Licensor or You, including from the legal
processes of any jurisdiction or authority.
"""
import sys
import os
import subprocess
# file types to not attempt to unarchive because their unarchiving fails, at least for your production
fail_extensions = []
# file names to not attempt to unarchive because their unarchiving fails, at least for your production
fail_files = []
# file types to not attempt to unarchive because you don't need to see their unarchived contents
unnecessary_extensions = []
# file names to not attempt to unarchive because you don't need to see their unarchived contents
unnecessary_files = []
# If True, extensions to be unarchived will be parsed from output of "7z i"
do_parse_include_extensions = True
# file types to unarchive even if "7z i" output is not parsed or even if these file types are not output by "7z i"
force_extensions = []
# file names to unarchive even if "7z i" output is not parsed or even if these files' types are not output by "7z i"
force_files = []
fail_extensions = [x.lower() for x in fail_extensions]
fail_files = [x.lower() for x in fail_files]
unnecessary_extensions = [x.lower() for x in unnecessary_extensions]
unnecessary_files = [x.lower() for x in unnecessary_files]
force_extensions = [x.lower() for x in force_extensions]
force_files = [x.lower() for x in force_files]
exclude_extensions = set(fail_extensions + unnecessary_extensions)
exclude_files = set(fail_files + unnecessary_files)
include_extensions = set(force_extensions)
include_files = set(force_files)
# temporary suffix added to end of each archive file for which unarchiving has been attempted to avoid a second attempt
attempted_unarchiving_file_suffix = '-attempted_unarchiving'
# suffix added to the end of the folder created to hold the contents of an unarchived archive file
unarchived_folder_suffix = '-unarchived'
def parse_include_extensions() -> None:
"""
Parses the output of the "7z i" command to determine what extensions 7z says it can process.
At the end of this file is an example of the "7z i" output.
Adds to the set in the include_extensions global variable.
:return: None
"""
if do_parse_include_extensions:
start_of_formats_section = False
for line in os.popen('7z i').readlines():
line = line.rstrip()
# Parses the "7z i" section between the line that begins with "Formats:" and the next blank line
if start_of_formats_section:
if not line:
# End of "Formats:" section
break
# Ignore characters before start of the blank-separated extensions and create list of tokens that follow
tokens = line[26:].split(' ')
# Not all tokens at the end of the list might be extensions, so prune non-extensions from right-to-left
add_remaining_tokens_as_extensions = False
for index, token in enumerate(reversed(tokens)):
if len(token) > 2 or index == len(tokens) - 1:
if not token.startswith('offset=') and not token.startswith('\\x') and not token == '(~.swf)':
add_remaining_tokens_as_extensions = True
# Some extensions are specified as just, for example, "abc" and some are specified as "(.abc)";
# normalize each to ".abc"
if add_remaining_tokens_as_extensions:
if token.startswith('(.') and token.endswith(')'):
token = token[1:-1]
else:
token = '.' + token
include_extensions.add(token)
if line.startswith('Formats:'):
start_of_formats_section = True
if include_extensions or include_files:
formatted_list = ' '.join([('*' + x) for x in sorted(include_extensions)] + sorted(include_files))
print(f'\nWill attempt to unarchive these files: {formatted_list}')
else:
print(f'Will attempt to unarchive all files')
if exclude_extensions or exclude_files:
formatted_list = ' '.join([('*' + x) for x in sorted(exclude_extensions)] + sorted(exclude_files))
print(f'\n...except will not attempt to unarchive these files: {formatted_list}')
def initial_linking(source_path: str, target_root: str) -> None:
"""
Clones the files and folders in the source_path to the target_root folder, using hard links if available
:param source_path: Path of file that might be an archive or of folder that might contain one or more archives
:param target_root: Path of folder that will contain the recursively unarchived folder tree
:return: None
"""
print(f'\nLinking...')
if os.path.isdir(source_path):
source_path = source_path if not source_path.endswith(os.path.sep) else source_path[:-1]
for source_folder, source_sub_folders, source_files in os.walk(source_path):
for source_file in source_files:
process_file(source_path, source_folder[len(source_path)+1:], source_file, target_root)
for source_sub_folder in source_sub_folders:
process_folder(source_path, source_folder[len(source_path)+1:], source_sub_folder, target_root)
else:
process_file(os.path.dirname(source_path), '', os.path.basename(source_path), target_root)
def process_file(source_root: str, source_rel: str, source_file_name: str, target_root: str) -> None:
target_path = os.path.join(target_root, source_rel)
os.makedirs(target_path, exist_ok=True)
os.link(os.path.join(source_root, source_rel, source_file_name), os.path.join(target_path, source_file_name))
def process_folder(__source_root: str, source_rel: str, source_folder_name: str, target_root: str) -> None:
target_path = os.path.join(target_root, source_rel, source_folder_name)
os.makedirs(target_path, exist_ok=True)
def unarchive_recursively_passes(target_root: str, passwords: [str]) -> None:
"""
Pass through the entire production unarchiving archive files. If any archive files were unarchived in a pass, do
another pass since the prior unarchiving pass could have expanded to a folder tree that contains more archive files.
Stop when a pass finds no archive files.
:param target_root: Target folder
:param passwords: List of passwords to try for each archive.
:return: None
"""
perform_another_pass = True
pass_number = 1
while perform_another_pass:
print(f'\nUnarchiving... (pass {pass_number})')
perform_another_pass = unarchive_pass(target_root, passwords)
pass_number += 1
print(f'None')
def unarchive_pass(target_root: str, passwords: [str]) -> bool:
"""
Looks for archive files using the following precedents:
ignore archive files for which unarchiving has already been attempted.
ignore files that match either the file extension or file name exclusion lists.
if either a file extension or file name inclusion list is specified, attempt to unarchive each file that matches
either the file extension or file name inclusion list.
if neither a file extension nor file name inclusion list is specified, attempt to unarchive all files (yes, even
files like "*.txt" files, because some file's extension hides the fact that they are actually unarchive'able).
:param target_root: path where the results go
:param passwords: list of passwords to try for each unarchiving attempt
:return: False, if no unarchiving was attempted on this pass; True, if at least one unarchiving was attempted on this
pass, implying that an unarchiving in this pass might have exposed additional archive files to be processed in the
next pass.
"""
perform_another_pass = False
for folder, __, files in os.walk(target_root):
for file in files:
if not file.endswith(attempted_unarchiving_file_suffix):
_base, extension = os.path.splitext(file)
extension = extension.lower()
if (not exclude_extensions or extension not in exclude_extensions) and (not exclude_files or file not in exclude_files):
if (not include_extensions and not include_files) or (include_extensions and extension in include_extensions) or (include_files and file in include_files):
archive_path = os.path.join(folder, file)
print(f"{archive_path[len(target_root) + len('/'):]}")
unarchive_file(archive_path, extension, passwords)
perform_another_pass = True
return perform_another_pass
def unarchive_file(archive_path: str, extension: str, passwords: [str]) -> None:
"""
If a file has already been identified as being an archive (by the caller of this function), then call the external
command to do the unarchiving.
:param archive_path: Path of the archive file to be unarchived
:param extension: The extension of the archive file
:param passwords: The list of zero or more passwords to apply to this, and every other, archive
:return: None
"""
unarchived_folder_path = archive_path + unarchived_folder_suffix
if not passwords:
passwords.append('password')
for password in passwords:
try:
if extension.lower() == '.rar':
command = ['unar', '-q', '-p', password, '-o', unarchived_folder_path, archive_path]
else:
command = ['7z', 'x', archive_path, '-bso0', '-bsp0', f'-p{password}', f'-o{unarchived_folder_path}']
result = subprocess.run(command, capture_output=True)
if result.returncode != 0:
print(result.stderr.decode('utf-8'), end='')
exit(1)
os.rename(archive_path, archive_path + attempted_unarchiving_file_suffix)
return
except subprocess.TimeoutExpired:
pass
print(f'Could not unarchive file: {archive_path}')
exit(1)
def remove_attempted_unarchiving_file_suffixes(target_root: str) -> None:
"""
To avoid unarchiving an archive more than once, an already unarchived archive is given a unique suffix to be
identified as already having been processed. After all recursive unarchiving passes have completed, this function
removes those suffixes.
:param target_root: Target folder
:return: None
"""
print(f'\nRemoving Attempted Unarchiving File Suffixes...')
paths = []
for folder, __, files in os.walk(target_root):
for file in files:
if file.endswith(attempted_unarchiving_file_suffix):
paths.append(os.path.join(folder, file))
for path in sorted(paths, reverse=True):
os.rename(path, path[:-len(attempted_unarchiving_file_suffix)])
def identify_longest_paths(target_root: str) -> None:
"""
Prints the longest path. Necessary because Windows does not allow a path > 260 characters, including "C:\\" and a
null byte at the end of the path. For example, if "C:\\Review\\" is prepended to every path in the target_root, then
the maximum path starting from the target_root to the end is 260 - length("C:\\Review\\") - length(b'0x00') = 249
:param target_root: The root of the target folder
:return: None
"""
print(f'\nIdentifying Longest Paths without Prefix (> 260 characters with prefix is too long)...')
longest_path_len = 0
longest_paths = []
for folder, _, files in os.walk(target_root):
for file in files:
path = os.path.join(folder, file)[len(target_root) + len('/'):]
path_len = len(path)
if path_len > longest_path_len:
longest_path_len = path_len
longest_paths = [path]
elif path_len == longest_path_len:
longest_paths.append(path)
for longest_path in sorted(longest_paths):
print(f'{longest_path_len}: {longest_path}')
def unarchive_recursively(source_path: str, target_root: str, passwords: [str]) -> None:
"""
Recursively unarchives any archive files found in source_path
:param source_path: Path of the file or folder that might contain one or archive files
:param target_root: Path of the folder to contain the recursively unarchived folder tree
:param passwords: List of zero or more passwords to apply to each archive
:return: None
"""
if source_path.endswith('/'):
source_path = source_path[:-1]
if target_root.endswith('/'):
target_root = target_root[:-1]
if not os.path.exists(source_path):
print(f'Source Path Does Not Exist: {source_path}')
exit(1)
if os.path.exists(target_root):
print(f'Target Path Exists: {target_root}')
exit(1)
parse_include_extensions()
initial_linking(source_path, target_root)
unarchive_recursively_passes(target_root, passwords)
remove_attempted_unarchiving_file_suffixes(target_root)
identify_longest_paths(target_root)
def _unarchive_recursively_main():
"""
:usage: unarchive_recursively source_file_or_folder target_folder [password...]
source_file_or_folder A file or a folder.
Trivially, if this is a file which is not an archive, the target folder will
contain the source file, and if this is a folder that does not contain any
archive files, the target folder will contain the source folder.
target_folder The recursively unarchived source is put under the target folder.
The target folder is populated with hard-links of all files from the source.
[password...] One or more passwords can be specified. For each archive, each specified
password will be tried until one succeeds. This allows you to unarchive multiple
nested archive files that use different passwords with only one invocation of
this utility. If no password is specified, the default is "password".
Dependencies
If even one archive file in the source is a .rar file, then the unar utility must be in your search path
(for macOS use: brew install unar).
If even one archive file in the source is not a .rar file, then the 7z utility must be in your search path
(for macOS use: brew install p7zip).
Preferences
It may be the case that files which this utility considers to be archive files, you don't feel you need to
unarchive, for whatever reason. You can prevent unarchiving files with specific extensions and/or files with
specific names by listing them in the Python variables "unnecessary_extensions" and "unnecessary_files" at the
top of this script.
By default, this script determines which files are archives based on the extensions output by calling the "7z i"
command. This command lists over 100 extensions. If you want to only unarchive files with a smaller number of
specific extension and/or files with specific names, then set the Python variable "do_parse_include_extensions"
at the top of this file to False and list only the extensions you want to unarchive in the Python variable
"force_extensions" and/or the file names you want to unarchive in the Python variable "force_files", both at the
top of this script. Even if you allow the default so that this script determines which files are archives based
on the extensions output by calling the "7z i" command, you can still list file extensions and file names in the
"force_extensions" and "force_files" variables to archive files that would not otherwise be archived by default.
Troubleshooting
If this utility hangs, it might be waiting for the password of an archive file which you did not specify because
you did not know the archive file requires a password.
It may be the case that files which this utility considers to be archive files are not really archive files,
or are corrupted archive files. You can prevent unarchiving files with specific extensions and/or files with
specific names by listing them in the Python variables "fail_extensions" and "fail_files" at the top of this
script.
Paths that might be too long for Microsoft Windows are identified
By default, the Microsoft Windows operating system does not allow the path of a file to be longer than 260
characters. This limit includes the drive designation (e.g., "C:\\") and includes the null byte at the end of the
path. For example, if all the files in the production are under the folder "C:\\Review\\", then the maximum path
following "C:\\Review\\" will be 249 characters; that is, 260 - length(“C:\\Review\\”) - 1.
While a folder tree itself can reach the 260 character path limit, archives embedded in other archives can
quickly reach or exceed the 260 character path limit as they are recursively unarchived. Therefore, this script
identifies the longest paths.
Already unarchived archive files are avoided
It is possible that the same folder in the production might contain both an archive file and a folder containing
the unarchived contents of that archive file. For example, I have seen folders in productions which contain
something like the following:
foo.zip
foo\
In those situations, the folder named “foo\\” contained the unarchived contents of the archive file named
“foo.zip”. However, the folder named “foo\\” is not guaranteed to contain the unarchived contents of the archive
file named “foo.zip”.
For this reason, when unarchiving a file named “foo.zip”, this script will unarchive its contents into a folder
called “foo.zip-unarchived\\”, with the expectation that there will not already be a folder named
“foo.zip-unarchived\\”. Therefore, in the above example, after this script is run, there will be the following:
foo.zip
foo\\
foo.zip-unarchived\\
If the contents of the folder named “foo\\” is indeed the unarchived contents of the archive file named
“foo.zip”, then the contents of the folder named “foo.zip-unarchived\\” will be a duplicate of the contents of
the folder named “foo\\”.
"""
if len(sys.argv) >= 4:
passwords = sys.argv[3:]
else:
passwords = []
unarchive_recursively(sys.argv[1], sys.argv[2], passwords)
if __name__ == '__main__':
_unarchive_recursively_main()
"""
The following is an example of the output of "7z i" that is parsed by the parse_include_extensions() function...
% 7z i
7-Zip [64] 17.03 : Copyright (c) 1999-2020 Igor Pavlov : 2017-08-28
p7zip Version 17.03 (locale=utf8,Utf16=on,HugeFiles=on,64 bits,10 CPUs x64)
Libs:
0 /usr/local/Cellar/p7zip/17.03/lib/p7zip/7z.dll
Formats:
0 C F 7z 7z 7 z BC AF ' 1C
0 APM apm E R
0 Ar ar a deb lib ! < a r c h > 0A
0 Arj arj ` EA
0 CK bzip2 bz2 bzip2 tbz2 (.tar) tbz (.tar) B Z h
0 F Cab cab M S C F 00 00 00 00
0 Chm chm chi chq chw I T S F 03 00 00 00 ` 00 00 00
0 F Hxs hxs hxi hxr hxq hxw lit I T O L I T L S 01 00 00 00 ( 00 00 00
0 Compound msi msp doc xls ppt D0 CF 11 E0 A1 B1 1A E1
0 M Cpio cpio 0 7 0 7 0 || C7 q || q C7
0 CramFS cramfs offset=16 C o m p r e s s e d 20 R O M F S
0 G B Dmg dmg k o l y 00 00 00 04 00 00 02 00
0 E ELF elf E L F
0 Ext ext ext2 ext3 ext4 img offset=1080 S EF
0 FAT fat img offset=510 U AA
0 FLV flv F L V 01
0 CK gzip gz gzip tgz (.tar) tpz (.tar) apk (.tar) 1F 8B 08
0 GPT gpt mbr offset=512 E F I 20 P A R T 00 00 01 00
0 M HFS hfs hfsx offset=1024 H + 00 04 || H X 00 05
0 O IHex ihex
0 Iso iso img offset=32769 C D 0 0 1
0 Lzh lzh lha offset=2 - l h
0 K O lzma lzma
0 K lzma86 lzma86
0 M E MachO macho CE FA ED FE || CF FA ED FE || FE ED FA CE || FE ED FA CF
0 P MBR mbr
0 MsLZ mslz S Z D D 88 F0 ' 3 A
0 M Mub mub CA FE BA BE 00 00 00 || B9 FA F1 0E
0 F G Nsis nsis offset=4 EF BE AD DE N u l l s o f t I n s t
0 NTFS ntfs img offset=3 N T F S 20 20 20 20 00
0 E PE exe dll sys M Z
0 E TE te V Z
0 Ppmd pmd 8F AF AC 84
0 QCOW qcow qcow2 qcow2c Q F I FB 00 00 00
0 F Rar rar r00 R a r ! 1A 07 00
0 F Rar5 rar r00 R a r ! 1A 07 01 00
0 Rpm rpm ED AB EE DB
0 Split 001
0 M SquashFS squashfs h s q s || s q s h || s h s q || q s h s
0 C M SWFc swf (~.swf) C W S || Z W S
0 K SWF swf F W S
0 C O LH tar tar ova offset=257 u s t a r
0 O Udf udf iso img offset=32768 01 C D 0 0 1
0 FM UEFIc scap BD 86 f ; v 0D 0 @ B7 0E B5 Q 9E / C5 A0 || 8B A6 < J # w FB H 80 = W 8C C1 FE C4 M || B9 82 91 S B5 AB 91 C B6 9A E3 A9 C F7 / CC
0 FM UEFIf uefif offset=16 D9 T 93 z h 04 J D 81 CE 0B F6 17 D8 90 DF || x E5 8C 8C = 8A 1C O 99 5 89 a 85 C3 - D3
0 VDI vdi offset=64 10 DA BE
0 G VHD vhd c o n e c t i x 00 00
0 VMDK vmdk K D M V
0 C SN LH wim wim swm esd ppkg M S W I M 00 00 00
0 Xar xar pkg xip x a r ! 00 1C
0 CK xz xz txz (.tar) FD 7 z X Z 00
0 Z z taz (.tar) 1F 9D
0 C FMG zip zip z01 zipx jar xpi odt ods docx xlsx epub ipa apk appx P K 03 04 || P K 05 06 || P K 06 06 || P K 07 08 P K || P K 0 0 P K
0 CK zstd zst tzstd (.tar) 0 x F D 2 F B 5 2 2 . . 2 8 00
0 CK lz4 lz4 tlz4 (.tar) 0 x 1 8 4 D 2 2 0 4 00
0 CK lz5 lz5 tlz5 (.tar) 0 x 1 8 4 D 2 2 0 5 00
0 CK lizard liz tliz (.tar) 0 x 1 8 4 D 2 2 0 6 00
Codecs:
0 ED 40202 BZip2
0 4ED 303011B BCJ2
0 ED 3030103 BCJ
0 ED 3030205 PPC
0 ED 3030401 IA64
0 ED 3030501 ARM
0 ED 3030701 ARMT
0 ED 3030805 SPARC
0 ED 20302 Swap2
0 ED 20304 Swap4
0 ED 0 Copy
0 ED 40109 Deflate64
0 ED 40108 Deflate
0 ED 3 Delta
0 ED 21 LZMA2
0 ED 21 FLZMA2
0 ED 30101 LZMA
0 ED 30401 PPMD
0 ED 6F10701 7zAES
0 ED 6F00181 AES256CBC
0 ED 4F71101 ZSTD
0 ED 4F71104 LZ4
0 ED 4F71102 BROTLI
0 ED 4F71105 LZ5
0 ED 4F71106 LIZARD
Hashers:
0 32 202 BLAKE2sp
0 4 1 CRC32
0 20 201 SHA1
0 32 A SHA256
0 8 4 CRC64
0 16 205 MD2
0 16 206 MD4
0 16 207 MD5
0 48 208 SHA384
0 64 209 SHA512
0 4 203 XXH32
0 8 204 XXH64
"""