Send Close Add comments: (status displays here)
Got it!  This site "robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website.  Note: This appears on each machine/browser from which this site is accessed.
Zip file compression
by RS  admin@robinsnyder.com : 1024 x 640


1. Zip file compression
Data in many applications, such as data science applications, are often created, distributed, and used in a compressed file format containing many folders and files using the ZIP file format.

2. Compression

3. Space and time
To compress something is to make it take up less space or time.

4. Physical objects
Can we compress a car? We can compress a car and it will take less space. But it is not easy to recover the original car.

5. Data
We can compress data and then recover the data.

Some students would like it if the class were compressed so that they could get out early.

6. Internet downloads
Internet downloads are quicker using compression. Here is the method. It does not work well to download a pizza!

7. Zipped files
Compressed files are sometimes called zipped files, a term derived from the PKZip software program. That is, you unzip a zipped file to decompress it.

Operating systems have built-in decompression software for zip files.

8. Types of compression
There are two types of compression.

9. Lossless compression
In lossless compression, nothing is lost. Examples: Image examples: *.bmp files, *.gif files.

10. Lossy compression
In lossy compression, some bits are lost. Examples: Image examples: *.jpg files, *.wav (although the uncompressed wav format can be easier to work with in some applications).

Zip files are a way to combine and compress multiple folders and files into one file.

11. Zip codes
A zip file is unrelated to a USPS (United States Postal Service) ZIP (Zone Improvement Plan).

12. Zipper

13. Mozilla
Mozilla applications such as Firefox (browser) and Thunderbird (email client) store their add-ins in the form of ZIP files with extension xpi for XPI (Cross Platform Installer).

14. OpenXML
Files in OpenXML can be the source of textual data for data science.

OpenXML is an open XML standard for representing textual data in the form of documents, spreadsheets, presentations, etc. The old proprietary binary file format is in parentheses. Each file is a collection of folders and files in ZIP file format.

15. Other systems
The following files are renamed zip files.

16. Files and folders
A traditional file system contains folders and files. One navigates the tree-structured directories of folders to organize and find files.

17. SQL
An SQL (Structured Query Language) database is a relational database designed to minimize redundancy of data using relations and normal forms.

18. NoSQL

19. Redis
Redis logoRedis is an open source (BSD licensed), in-memory data structure store, used as a database, cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. Redis has built-in replication, Lua scripting, LRU eviction, transactions and different levels of on-disk persistence, and provides high availability via Redis Sentinel and automatic partitioning with Redis Cluster. (Redis web site) Redis (Remote Dictionary Server) is a form of NoSQL database.

Redis popularized the idea of a system that can be considered at the same time a store and a cache, using a design where data is always modified and read from the main computer memory, but also stored on disk in a format that is unsuitable for random access of data, but only to reconstruct the data back in memory once the system restarts. (Wikipedia)

20. Usage
Redis is a key-value in-memory database often used to improve web performance. I have used Redis to improve network performance of distributed processing of problems (e.g., legal work, topic modeling, etc.) where 10 to 20 or more machines work on the same problem at the same time (needed for the legal work, can be scaled up using small credit-sized computer boards).

21. AWS
AWS logoAWS (Amazon Web Services) provides cloud services and storage in NoSQL form.

In such a form, what appear to be folders and files are just files where the folder separator, such as "\" or "/" are just characters in a name. The navigation provided is illusory and one needs to be aware of that. Complex pattern matching (regular expressions, etc.) are used to find and search for files, etc.

Relational database concepts and thinking are still valuable, but not directly supported. One needs to be aware of this.

22. Zip file format
The format for folders and files in a ZIP file are similar to NoSQL in that what appears to be a folder and file structure are illusory, It is a flat file system that appears to be hierarchical.

Packages provide ways to think about the folders and files in a hierarchical manner. One needs to be aware of this when programming ZIP file adds and reads.

In a zip file, only files are stored. Folders are just a prefix of text for a file. In some cases, tricks or special handling is needed to include folders in which there is no content.

23. Compression
A zip file can contain compressed files. If the compression ratio is large, there is a lot of redundancy in the file.

24. Python zip file processing
Python can be used to both create and read zip files.

Python can read most zip files with password encryption but cannot (without third party support) create zip files with encryption.

In most cases, calling an external program such as 7-Zip can provide a way to create zip files with encrypted password.

25. Read a zip file
Python can be used to read a zip file.

The zipfile package is required (installed as part of Python).

In a previous Python example, a docx was created. This docx file is read and the folders and files displayed. Here is the Python code [#1]

Here is the output of the Python code.


26. Compression
Note that the amount of compression varies from file to file.

A large amount of compression means that the original file had a lot of redundancy in it.

27. OpenXML
Note: The OpenXML file format is very complex in that there are many internal references and cross-references that need to be exact.

Example: In PowerPoint if any reference or cross-reference is not correct, the Presentation will be corrupted and may not load at all. These formats are not something that one would manually create.

28. Create and read a zip file
Python can be used to create a zip file.

For compression, the zlib package is needed (installed as part of Python).

The following program then opens and reads the zip file just created. Instead of dynamically creating a file and then adding it, the writestr method is used to write text directly (as if it were a file). Here is the Python code [#2]

Here is the output of the Python code.


29. Zip file
Created zip file in WinZipHere is how the created zip file appears in WinZip with the "DATA" folder expanded.

30. Notes
Note: For very small amounts of text, compression may actually increase the size of the needed bytes.

Note: There are many other options available for time and date stamps, comments, etc.

31. End of page

by RS  admin@robinsnyder.com : 1024 x 640