Beste art um Kombinationen zu programmieren?
Moin!
Was ist die effizienteste Methode, wenn man mit Kombinationen programmieren will? Und welche Programmiersprache?
Ist eine Datenbank das schnellste?
Ich kenne mich leider kaum aus…
Ich kann es ja kurz anreissen, was ich machen will.
Es sollen zb 100 textdatein geladen werden. In diesen sind circa à 300.000 zeilen mit zahlen. Insgesamt gibt es aber nur 500.000 verschiedene zahlen. Nun möchte ich das immer 15 Datein zusammengefügt werden.
Wenn zb 6 Datein die volle Zahl von 500.000 erreichen, wird dort nicht weiter kombiniert.
Ich möchte so heraus finden, welche 15 Datein unter 500.000 bleiben.
Ich hoffe es ist verständlich in abgespeckter form😁
Lg!
So I’ll go with the funny advice…
You have files containing the numbers. There are 300,000 clear numbers across all files. You now want to read a file (File_1) and determine which unique numbers are inside. After that, you read the next (File_2) and determine which unique numbers it contains, which are not included in File_1… You do this until File_1,…,File_n contains all the different numbers. File_1,…,File_n becomes a group of files. Now the same game starts with File_(n+1), File_(n+2) and so on.
Many languages can be such a thing, for example Python.
Yes, about 300,000 could be more or less.
The rest fits.
How do you build this? The combinations go into the zich zich billions!
Why? 1-15 form a group, 16-30 the second, 31-45, 46-60, 61-75, 76-90, 91-100. I only come to 7 files with a maximum of 500,000 lines. This isn’t very special, and it’s going through in less than a minute.
Or you said you were wrong and mean something else.
So, if you want the file groups of 15 files to be optimal in the sense that they each contain the maximum number of numbers (some complete, others incomplete), then you need to build an index. The index contains the number in which files are present. This allows you to quickly answer the question of how many file groups you can create maximum that contains all numbers. You can also search “fail first”, i.e. you take the number that is most rare, take the first file that contains it, and then you will find the file that has the least overlap. So you go ahead until you have a group full. Then take the next unused document in the list.
This would perhaps lead to a “good” solution. Overall, I think the problem is NP-hart.
Yeah, 15 out of 100 is a lot of banal. That’s why you need to be smarter. And that’s why you need to put yourself together with someone who can program and understand the problem you’re trying to solve.
It’s about combinations, not about fixed groups, 1-15 was just an example.
After combining 1-15 comes 1-14 + 16, then 1-14 +17 Etc..
But that’s all gone when 1-10 are full.
It becomes more interesting if you have an optimality criterion. That would have to be an index.
trivial
I don’t make the effort [01][02][43] to convert into a number is simply sorting the strings You want to know only many elements left anyway….
Using millions of data lines from files is a “house number” in every language What the program does in memory is less the problem than reading the files!
Iterate via the desired files
..the 2.Buffer you can then write into a file of your choice or whatever.
to get a block of 15 files
Instead of useless pseudocode…
Quick&Dirty in Powershel (Get-Content is not necessarily a racehorse when reading large files, but more clearly for a demo) :
I believe that if you can formulate your problem in a comprehensible way, the solution is no longer far away.
So 300,000 numbers per file? What criteria should 15 files be merged? What’s that?
They are simply joined together. As if you would do all rows of all 15 files in a pure, double rows are overwrite
Whatever you want…
So you take file 1 with 300,000 numbers, add from file 2 all the numbers that are not already in, then file 3 etc. until you have either used file 15 or the target file contains 500,000 numbers? If this is reached at file 10, 11-15 will be ignored?
Do you set any requirements for the format of the target file? Or does it just contain numbers?
Yes, as you described.
The target file should just give back the numbers, so what’s in the lines.
They are not the numbers of 10 to 500,000 but zb of 1mio – 5mio. The number is specified by a file. Therefore, it would be best if an add file serves as a template for what the “frame” is.
So virtually sets the frame of 500,000, so it’s better to make it on line level.
Yes, as you described.
The target file should just give back the numbers, so what’s in the lines.
I still don’t know what you want.
You obviously have a lot of D={ D_1, D_2, …} of files that can each be seen as a lot of D_i of lines. You know the amount Z of all possible lines. And now you’re looking for groups of 15 files whose association not Z is. You need:
I would process such amounts of data in Python and manage for each line a set of files that contain this line. These quantities can be efficiently reproduced as integers, each bit corresponding to a file:
If after that any row occurs in at most 85 (=100-15) files, any 15 files from the rest are a possible solution. Otherwise (if all lines occur at least 86 times) there is no solution:
A bit more difficult it becomes to enumerate all possible groups. Actually, each group of 15 files (as above) counts for each line. However, groups are counted several times if two or more lines are missing in their association. It is certainly not practical to remember all already calculated groups in order to avoid doublettes, because they are much too many.
In principle, you can ignore all rows that occur in 86 or more files (because they occur in each group), and then sort out the files that are in the cut of the other rows. This is likely to greatly reduce the amount of data.
However, how to proceed exactly depends on whether you want to distinguish two different file groups with the same amount of association: